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Abstract 

We study the task of cleaning scanned text documents 
that are strongly corrupted by dirt such as manual line 
strokes, spilled ink etc. We aim at autonomously remov- 
ing dirt from a single letter-size page based only on the 
information the page contains. Our approach, therefore, 
has to learn character representations without supervision 
and requires a mechanism to distinguish learned represen- 
tations from irregular patterns. To learn character repre- 
sentations, we use a probabilistic generative model param- 
eterizing pattern features, feature variances, the features ' 
planar arrangements, and pattern frequencies. The latent 
variables of the model describe pattern class, pattern posi- 
tion, and the presence or absence of individual pattern fea- 
tures. The model parameters are optimized using a novel 
variational EM approximation. After learning, the parame- 
ters represent, independently of their absolute position, pla- 
nar feature arrangements and their variances. A quality 
measure defined based on the learned representation then 
allows for an autonomous discrimination between regular 
character patterns and the irregular patterns making up the 
dirt. The irregular patterns can thus be removed to clean the 
document. For a full Latin alphabet we found that a single 
page does not contain sufficiently many character examples. 
However, even if heavily corrupted by dirt, we show that a 
page containing a lower number of character types can ef- 
ficiently and autonomously be cleaned solely based on the 
structural regularity of the characters it contains. In dif- 
ferent examples using characters from different alphabets, 
we demonstrate generality of the approach and discuss its 
implications for future developments. 

1. Introduction 

A basic form of human communication, written text, 
consists of planar arrangements of reoccurring and regu- 
lar patterns. While in modem forms of text these patterns 
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are characters or symbols for words (e.g., Chinese texts), 
early forms consisted of symbols resembling objects. Writ- 
ten text became a successful form of communication be- 
cause it exploits the readily available capability of the hu- 
man visual system to learn and recognize regular patterns 
in visual data. In recent years, computer vision and ma- 
chine learning became increasingly successful in analyzing 
visual data. Much progress has been made, for instance, 
by probabilistic modeling approaches that aim at capturing 
the statistical regularities of a given data set. Examples are 
image denoising by Markov Random Fields [ ] or sparse 
coding models [ , ]. For many types of data, modeling 
approaches hereby have to address the problem that regular 
visual structures often appear at arbitrary positions. Sparse 
coding approaches indirectly address this problem by repli- 
cating a learned structure (e.g., a Gabor wavelet) at different 
positions of image patches. Other approaches go one step 
further and explicitly model pattern positions using addi- 
tional hidden variables [ , , 6, 7, 8, 9]. However, the com- 
binatorics of object identity and position introduces major 
challenges as for each pattern class all positions ideally have 
to be considered. 

In this paper we apply a probabilistic generative ap- 
proach with explicit position encoding to remove dirt from 
text documents. The principle idea is very straight-forward: 
If characters are the salient regular patterns of text, an ap- 
propriately structured probabilistic model should be able to 
learn character representations as regular arrangements of 
features. In contrast, dirt is much more irregular. Coffee 
spots, spilled ink, or line-strokes scratching-out text share 
similar features with printed characters but such corruptions 
are, on average, much more random combinations of fea- 
ture patterns. Based on this observation, the autonomous 
identification and recovery of characters from a corrupted 
text document should thus be possible. But how difficult is 
such a task? Or how robust can a solution of such a task 
be if the data is heavily corrupted by dirt? Would the in- 
formation contained on a single page of a dirty document, 
for instance, be sufficient to identify the characters con- 
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taining it? And if yes, can this be used to 'self-clean' the 
document? Such questions can, of course, not be answered 
by a clear 'yes' or 'no' because they will, e.g., depend on 
the type and degree of dirt or on the amount of available 
character information on a page. However, we will show 
that a self-cleaning of heavily corrupted documents is, in- 
deed, possible, e.g., for relatively low numbers of different 
character types. The only prerequisite will hereby be the 
characters' regular feature arrangements. No information 
about the characters has to be available, which makes the 
approach applicable to entirely unknown character types. 
The problem addressed here is thus very different from the 
one aimed at by optical character recognition (OCR) meth- 
ods that use supervised pretraining on known characters. In 
contrast, we require unsupervised methods to learn charac- 
ter representations. The generative model we apply is simi- 
lar to models suggested by Williams & Titsias [ ] and Jojic 
& Frey [ % 9, 10], which provide explicit representations 
of the data's regular patterns. As the data points we will 
have to process are image patches of corrupted text doc- 
uments, these previous models are not applicable because 
they require a static background, do not provide a mecha- 
nism to discriminate characters from irregular patterns, and 
are based on pixel image representations which can make 
learning less robust. In contrast, we (1) will have to al- 
low for varying fore- and background patterns (to take dirt 
into account), (2) will introduce a mechanism for character 
vs. dirt discrimination, and (3) will consider general feature 
vector representations of the data. Together with a novel 
non-greedy training scheme in the form of truncated vari- 
ational EM [11], the derived method will provide the re- 
quired robustness and efficiency for the task. 

2. A Generative IModel for Characters 

The probabilistic model we consider generates small im- 
age patches of size D = {Di^D2). A pixel at position d 
of the patch is represented by a feature vector with F 
entries. For now can be thought of as a color vector at 
pixel position d in RGB space (F = 3). For the applica- 
tion to text documents we will later use more sophisticated 
features, however. 

A patch Y = (^(1,1), . . . , y{Di,D2)) is modeled to con- 
tain one pattern at an arbitrary position of the patch (see 
Fig. la). For the class variable c we use a standard mixture 
model prior with tt = (tti, . . . , ttc) denoting the mixing 
proportions and C denoting the total number of classes: 

p{c\7l) = Tic with X)f=l TTc = 1 . (1) 

The position of the pattern in the patch, x G V^V = 
{1, . . . , I^i} X {1, . . . , D2}, is a 2D vector chosen from a 
uniform distribution over the entire patch: 

p{x) =p{xi)p{x2) 

1 (2) 

= Uniform(l,i:>i) x Uniform(l, D2) = 7^^- 

D1D2 




The shapes of different patterns are modeled by a set of 
binary latent variables, namely the pattern mask: rh = 
(m(i^i), . . . ,^(Pi,P2))' where G {0, 1}. With = 1 
the corresponding feature is part of the pattern, while with 
= it is part of the background. The pattern size 

P = (Pi, P2) can be different from the image patch size 
Pi < Di^P2 < D2- Given the pattern class c, the mask 
variables are drawn from Bernoulli distributions: 

p{m\c,A)^ n p(^rh^)- n 

?=(1,1) ?=(1,1) 

(3) 

where A (A\ . . . , A^) with A"" 

(^(1 1)' • • • ' ^(Pi P2)) parameters of the mask 

distribution. For the area where the image patch is outside 
the pattern, the mask variables are always assigned 0: 
p{mj; = 1 |c, A) = p{mj; = 1) = 0,yd e V - V,V = 
{1, . . . , Pi} X {1, . . . , P2}. From the definition of masks, 
a background distribution is required for all those features 
not belonging to a pattern (m^ = 0). A possible choice is 
a flat Gaussian distribution (compare [ ]). However, for 
data such as patches from corrupted text documents, the 
distribution values are often very different for the different 
feature vector entries, and for the dirty background are 
often observed to be non-Gaussian. To appropriately model 



the background features, we therefore construct a prob- 
abiUty density function by computing the histogram 
of different feature values across the image patches. The 
probabiHty densities for the individual feature vector entries 
will be modeled individually (see Fig. 2a for histograms 
of R, G, and B channel). The histograms are computed 
across all the image patches including the features that 
are potentially later identified as being part of the learned 
patterns. Nevertheless, the computed histograms are 
usually very similar to the true background distributions 
(compare Fig. 2a). Once computed we therefore leave 
the histograms fixed throughout learning. Having defined 
the background distribution 1-L b and given pattern class c, 
mask m, and pattern position x, the distribution of patch 
features is given by: 

d=(i,i) (4) 

where is the mean of a Gaussian distribution 
and <I>S is the diagonal convariance matrix: <I>S = 
diag((cr$^_^)^, . . . , (cr$^_^)^). The mean w% parameter- 
izes the mean feature vector of pattern c at position i relative 
to the pattern position x. The variance vector <l>5 parame- 
terizes the feature vector variances (different variance per 
vector entry). The shift of a pattern c is implemented by a 
change of the position indices z by x using cyclic boundary 
positions: 



i\ mod Di, (i^ +X2) mod D2 



(5) 



Equations 1 to 5 define the generative model for im- 
age patches. The parameters of the model are given by 
e = iyV, A, 7f) with W = (yV^, . . . , W^) and W = 
(^(1 !)'•••' ^(Pi P2) )' together with the histograms for 
the background distribution. Fig. la shows schematically 
how a patch is generated. First, a pattern class is chosen 
(e.g., the class with pattern "B") and then the mask vari- 
ables m for the class (Eqn. 3). Pattern parameters and mask 
are then translated by a random position x before they are 
combined through a Gaussian distribution for the model and 
the learned distribution for the background Eqn. 4. 

3. Efficient Likelihood Maximization 

One approach of learning the parameters 6 from data 
y = (y^^^ , . • . , y^^^) is to maximize the data likelihood: 



G* — argm^x{£(G)}, 

(1) 



£(e) = iog(p(y^^\...,y(^)|e)) 



(6) 



A frequently used method to find the parameters 6* is Ex- 
pectation Maximization (EM), which iteratively optimizes 



a lower bound of the likelihood ^-"(6, q) w.r.t. the parame- 
ters 6 and a distribution q. With J2v denoting a summation 
across the joint space of all hidden variables V = (c, m, x) 
it is given by: 

N 



n = l V 
N 



(7) 



M-Step. Parameter update rules are canonically derived 
by setting the derivatives of T w.r.t. the parameters to 0. 

For the model (1) - (4), we obtain: 



^J2Po\c,x)p''Q\m:^=l\c,x) 



(8) 



^ ^^p''o\c,x)p'^^\m.= l\c,x) ' 



E E Po^ ic,x)p''^^ i'm^=M c,x) 



Pe^(m^|c,f) 



where we abbreviated: 

p(m^|r(-),c,f,e), p^q\c,x) := p(c, f , 6), 

and where denotes pointwise matrix multiplication (in 
this case with the unit matrix). 

E-Step. The crucial and computationally expensive part of 
EM is the computation of the expectation values w.r.t. the 
posterior. For each data point, this involves summations of 
probabilities for all combinations of the hidden variables c, 
rh and x. However, the summation over the latent combi- 
nations can be simplified. By exploiting the standard as- 
sumption of independent observed variables (compare, e.g., 
[^, . ]) given the latents (see Eqn. 4), the posterior distri- 
bution over rh can be decomposed into the product of the 
posteriors over individual binary masks as follows: 

(^1,^2) 

p(c,m,x|y,e) = ( n P(m^l>^,c,x,e))p(c,x|y,e). (9) 

i=(i,i) 

The posterior distribution over individual binary masks can 
then be computed as follows: 

The summation in the denominator can be computed effi- 
ciently as it only contains two cases: = and = 1. 
The posterior distribution over c and x can be computed as 
follows. 



(Pl,P2) 

T=(i,i) 

• p{x\e)p{c\e). 



(11) 



With such a decomposition (compare [9, 8]), the compu- 
tational complexity decreases from exponential to polyno- 
mial, which makes the computation tractable in principle. 
However, the computational complexity still grows very fast 
with the size of patterns and patches, 0{CDiD2PiP2)- For 
realistic image sizes (e.g., usually hundreds of thousands 
of pixels), it still exceeds currently available computational 
resources. To further improve efficiency, we therefore ap- 
proximate the computation of expectation values using vari- 
ational EM (e.g., [ ]). Source of the large computation is 
the required evaluation of all possible pattern positions for 
all classes. To reduce the number of hidden states that have 
to be evaluated, we apply a recent variational EM approach 
(Expectation Truncation, [ ]) which is well suited for dis- 
crete hidden variables. The used approach is not based on 
the usual factored form of q but on a truncated variational 
approximation to the posterior. Applied to the posterior (11) 
it is given by: 

p(c,x|y("\e) ^^n(c,x;e) 



p(c,x,y^"^ |e) 



, V(C, X) G Krx 



(12) 



and zero otherwise. The variational distribution approx- 
imates the true posterior with high precision if the set /C^ 
contains those classes and positions that carry most poste- 
rior mass for a given data point Y^'^\ In other words, for a 
given patch we have to find the most likely pattern classes 
together with their most likely patch positions in order to 
obtain a high quality approximation. To achieve this we de- 
fine a function (c, x) that assigns a score to each class 
and position pair (c, x) : 

s'i\c,x) =n [-^(2^;U ;^^,-^f'Mmr, = i|e) 



'ev' 



(13) 



with V'(. ^ V. This scoring (or selection) function (com- 
pare [ ']) gives high values to all those positions that are 
consistent with features in the set V'^. The set V'^ is in turn 
defined to contain the A most reliable features of pattern 
c. We define these features as those with the highest mask 
parameters a$. A small number of A results in a very effi- 
ciently computable function Sq^^ (c, x). Based on the selec- 
tion function, we now define the set of most likely class and 
position pairs to be: 



/Cn ={(c, x) I (c, x) has one of 



the {K CD1D2) largest values of S^J^^{x)} 



(14) 



where K G [0, 1] is the fraction of the joint space of all 
classes and positions (size C D1D2). 

In principle, the approximation [ ] can also be used to 
constrain the number of states of mask variables. How- 
ever, the computational gain is negligible as the posterior 



w.r.t. the mask can be computed efficiently (10). For the ap- 
proximation, note that A and K parameterize the accuracy. 
The higher A the more reliably is the selection of considered 
classes and positions, and the higher K the larger is the con- 
sidered area of the joint class and position space. However, 
the larger A and K the higher is the computational cost. For 
the highest possible value of A the selection becomes opti- 
mal as Sq-' (c, x) becomes proportional to p(c, x \ Y^'^'> , 6) 

(S^\c,x) becomes equal to p{c,x\Y^''\e)p{Y^''^ | 6) 
with p{Y'^^^ I B) being a constant for the selection). For the 
highest possible value of K, K = 1, all positions are con- 
sidered and the variational distribution (12) becomes equal 
to the exact posterior. In numerical experiments we found 
approximations with high accuracy and simultaneously low 
computational costs by choosing relatively low numbers of 
A (e.g., A = 200 out of P1P2 features) and relatively low 
fractions of considered joint space (e.g., K = 0.02). 

4. Learning and Identification of Characters 

Equations (8) to (14) define an approximate EM algo- 
rithm to learn character representations. These representa- 
tions will be used to remove dirt from documents as de- 
scribed in this section. Before, we numerically evaluate the 
learning procedure itself. 

Artificial data. Let us first consider artificial images for 
which ground truth information is available. For the training 
data, we generated N = 1000 RGB image patches (F = 3) 
of size D = (50,50) according to the model (1) to (4). 
Each patch contained one of five different character types 
with equal probability (tTc = 0.2). The chosen colored char- 
acters were generated from corresponding mask, mean and 
variance parameters (see Fig. la). The background color 
was drawn from a Mixture of Gaussians as an example of 
multi-modal distributions (see Fig. la). Fig. 2b shows a ran- 
dom selection of 5 generated data points. The derived EM 
learning algorithm was applied to the data assuming C = 5 
classes and P = D = (50,50). First, the background 
histogram H b was computed from the whole data set, and 
was observed to model the true generating distributions with 
high accuracy (the blue regions in Fig. 2a show the learned 
histograms compared to the true distributions in red). To 
infer the remaining model parameters they were first initial- 
ized: the pattern mean W was independently and uniformly 
drawn from the RGB -color-cube [0,1]^; the pattern vari- 
ance ^ was set to the standard deviation of the data set; and 
the initial mask parameters A were uniformly drawn from 
the interval [0, 1]. The learning course of the parameters 
is illustrated in Fig. 2c with iteration showing the initial 
values. After iteration 70, parameters had converged suffi- 
ciently. To visualize pattern variances in Fig. 2c, they are 
organized as a matrix for each pattern and each feature di- 
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Figure 2. Experiment on artificial 
data, a The background for data 
generation (red curves) and con- 
structed histogram TIb (blue re- 
gions), b 5 of = 1000 image 
patches, c The learning course of 
the parameters. W is visualized in 
RGB color space and aR, ctg and 
gb are visualized by heat maps. 



variance matrices are visualized by color images which are 
normalized individually. As can be observed, the algorithm 
successfully learned the model parameters. For the exper- 
iment of Fig. 2 and other similar experiments, the learned 
parameters diverged from the generating parameters by on 
average less than 3.0%. Convergence to local optima has 
only been observed in very few cases (1 of 10 runs). 

Scanned text documents. Let us now apply the learning 
algorithm to data from a single page of a scanned text docu- 
ment. Consider the corrupted document displayed in Fig. 3e 
which contains 5 character types, "a", "b", "e", "s" and "y". 
The printed document was manually corrupted with dirt in 
the form of line- strokes and with grayish spots. The dataset 
for training was created by a high-resolution scan of the 
document (3307 x 4677 pixels) and by automatically cutting 
the scan into small patches (120x 165 pixels) with fixed in- 
tervals. Fig. 3a shows five examples of such patches. The 
patches are used to generate the actual data points Y^^^ with 
vectorial features. Instead of RGB feature vectors as for the 
introductory example, we used feature vectors generated 
through Gabor filter responses. Gabor features are robust 
and widespread in image processing (see, e.g., [^, ' ^]) with 
high sensitivity to edge-like structures and textures. Fur- 
thermore, they are tolerant w.r.t. small local deformations 
and brightness changes. For the small patches we computed 
a Gabor feature with 40 entries at every third pixel, which 
resulted in 2D arrays of I^i x 1^2 = 40 x 55 Gabor feature 
vectors. The learning algorithm was applied to this data set 
assuming C = 6 classes. The pattern mean W was initial- 
ized by randomly selecting C = 6 patches from the dataset 
and cutting out a segment of the pattern size at random po- 
sitions. The remaining parameters were initialized in the 
same way as for artificial data. To increase computational 
efficiency we, furthermore, assumed with P = (30, 40) a 
pattern size smaller than the patch size but still larger than 
the size of any characters. Parameter optimization (44 EM 
iterations) took about 25 minutes on a cluster with 15 GPUs 
(GTX 480). Fig. 3b visualizes the inferred parameters after 
the application of the learning algorithm (see Suppl. for a 
visualization of the time-course of learning). As can be ob- 
served, the algorithm has successfully represented the five 
character types. They were represented by different classes 
using parameters for mask, mean features and feature vari- 
ances. As only five classes are needed to represent all the 
characters, the algorithm has assigned a pattern averaging 
other patterns and dirt to one of the classes (class 4). In 
numerical experiments on this and other documents, classes 



not representing characters had either much lower values for 
learned mask parameters (compare Fig. 3b) or much lower 
values for learned mixing proportions tTc. We exploited this 
observation to automatically classify character classes (see 
Suppl. for details). The full learning procedure then con- 
sisted of a repetition of the learning algorithm and a selec- 
tion of one of the results with the highest number of charac- 
ter classes. 

Character detection and identification. Based on the 
learned representation, characters in a given dirty document 
can now be detected and identified. We screen through 
the whole document from upper-left to lower-right patch by 
patch. Our aim is to identify a character within each patch 
y(^) and to assign to each match a quality measure, i.e., 
a measure reporting how well each character matches the 
learned representation of its class. To identify the position 
and type of a character in a patch we compute the MAP 
estimate of the approximate posterior: 



* ) — argmax{p(c, x | y , G) } 

c,x 

^ argmax{^n(c, x;e)}, 

(c,x)efCn 



(15) 



with qn{c^ X ; 6) and JCn defined as in Sec. 3. In analogy 
to template matching ([ , ] and many more) we refer to 
the result of the MAP estimate (15) as the match for the im- 
age patch, to X* as the matched position and to c* as the 

matched class. Furthermore, given the patch Y^^^ with 
match (c*, X*), we define the quality of the match as fol- 
lows. 



(16) 



where p{m^ = 1 | F, c*, x*, 6) is the posterior distribution 
of the binary mask (see Eqn. 10). The negative term in (16) 
is a normalized distance measure between mask parameters 
and mask posterior probabilities. To provide some intuition, 
suppose that the mask parameters are binary, i.e., they are 
either maximally reliable (a? = 1), or maximally unreliable 
(aS = 0). Then, the quality reveals the percentages of the 
pattern c* being matched in the patch. If for instance a patch 
contains a complete and clean instance of the pattern c* at 
position X*, p(m^ = 1 1 F, c* , x* , 9) is close or equal to one 
for all reliable features and zero otherwise. This implies that 
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Figure 3. Experiment on text document, a 5 of = 1379 image patches, b The learned parameters with max representation,e.g., 
^?max ~ ^^^fi^if) (s^^ Suppl. for the full representation), c The clean representation of each character type, d An illustration of the 
cleaning procedure, e The cleaning result of our algorithm. 



the distance measure is equal to zero andQ(r(^),c*,f*,e) 
equal to one. For an appropriate scaling of the match quality 
with the degree of dirt, the unreliable features in (16) have 
been down-weighted by factors (af)^ (we use 7 = 10). 
Given a patch Y and a match (c* , x*), the measure (16) thus 
assigns a quality value Q(r(^),c*,x*, 6) G [0,1] which 
well reflects the similarity between an input pattern at x* 
and its corresponding matched pattern of class c* . Low val- 
ues of Q correspond to poor matches and Q = 1 corre- 



sponds to a perfect match (see Suppl. for details). 
5. Corrupted Document Cleaning 

By making use of the learned character representation, 
character matching, and evaluation of match qualities, we 
can now remove dirt from a given corrupted scanned doc- 
ument. First, we use the match qualities to globally find 
the best matching input patterns for each class among all 
extracted patches. For each best input pattern we then com- 



pute a bounding box and store the corresponding pixel rep- 
resentation (see Fig. 3c). Then, using the best representa- 
tions, we can reconstruct the document (see Fig. 3e). In or- 
der to do so, we screen through the dirty document patch 
by patch and for each patch compute the match (c*, x*) us- 
ing (15) and the match quality using (16). If the matched 
position X* corresponds to a pattern fully visible within 
the patch, and if the match quality is above the threshold 
Qo = 0.5, we paint the best representation of class c* at 
position X* onto an initially blank reconstructed document. 
Fig. 3d illustrates this procedure for a small area of the ex- 
ample document. As can be observed, not all the matches 
are accepted for reconstruction, because some matches cor- 
respond to patterns not entirely visible (e.g., second patch 
at iteration 1) or match qualities are too low (e.g., last 
patch at iteration 2). The quality threshold prevents dirt 
from being reconstructed as characters. As for each patch 
just one match is computed, not all characters are recon- 
structed at first. For a complete reconstruction we there- 
fore replace each successfully reconstructed character in the 
original document by a blank rectangle (of the same size 
as the corresponding bounding box) and apply the proce- 
dure again. Patterns that previously were not identified be- 
cause of competition with other patterns can now be found 
and correctly reconstructed. We terminate the reconstruc- 
tion once no more matches are accepted. In Fig. 3d two 
iterations of the procedure are sufficient to successfully re- 
construct the word "bayes". The entire document in Fig. 3e 
is perfectly reconstructed after three iterations. The recon- 
structions of examples with more character types, non-Latin 
characters (Klingon) and random placement of characters 
show similar results (see Supplement). However, the more 
a document is corrupted by dirt, the less perfect we can ex- 
pect the reconstruction to be. In examples with dirt fully oc- 
cluding parts of the document, we do thus obtain many false 
negative errors (see Supplement). False positive errors are, 
on the other hand, obtained if, e.g., a random combination 
of manual line strokes coincides with the feature arrange- 
ment of a learned pattern (see Supplement). Although error 
rates for imperfect reconstructions can be decreased by fine 
tuning the threshold Qo, we left the parameter unchanged 
at Qo = 0.5 for all examples to demonstrate the generality 
of the approach. 

Note that the task of cleaning documents such as those 
in Fig. 3 or in the Supplement has previously not been ad- 
dressed. This is because of the difficulty posed by corrup- 
tions consisting partly of the same features as the characters 
(line strokes). Furthermore, extended line strokes severely 
affect any segmentation-based processing. It is in the nature 
of a new application domain that no data for comparison is 
available for our results. To provide, at least, a baseline, we 
applied a standard OCR approach (FineReader, [ ]) to the 
documents used in our experiments. For the document of 



Fig. 3, FineReader recognized 56.5% of the characters cor- 
rectly (essentially those that are segmentable) and corrup- 
tion by dirt causes 297 false positives. On the same data, 
our approach detects 100% of the characters correctly with 
no false positives (FP). More examples can be found in the 
Supplement. The poorest performance of FineReader in all 
the examples is observed for documents with non-standard 
characters or unusual character orientations. For the docu- 
ments in Figs. 11 and 15 (Suppl. C.2 & C.3), for instance, 
FineReader results in recognition rates of 0% (23 1 FP) and 
0.8% (86 FP), respectively. For comparison, our approach 
detects 100% (no FP) in Fig. 1 1 and 100% (3 FP) in Fig. 15. 
Performance of the unsupervised learning algorithm is high 
in these latter two examples because it can learn any charac- 
ter type while the poor performance of FineReader is sim- 
ply evidence for the data containing characters unknown to 
the OCR approach. Improvements of OCR would require 
additional training on labeled data. However, as briefly dis- 
cussed in the introduction, note that a comparison of OCR 
to our approach on these data is not fair. OCR is not in- 
tended for the task addressed here. Vice versa, our algo- 
rithm would not perform well on typical OCR tasks. 

6. Discussion 

We have studied an unsupervised approach to remove 
dirt from scanned text documents. Our approach relied on 
the learning of character representations using a probabilis- 
tic generative model with an explicit position variable. Sim- 
ilar to other probabilistic approaches, e.g., image denois- 
ing, we followed the general principle of capturing the reg- 
ularities of the data, and removed unwanted data parts after 
identifying them as deviations from the learned regularities. 
However, in contrast to approaches for noise removal, we 
learned explicit high-level representations of specific image 
components (characters). Having an explicit notion of fea- 
ture arrangements per character allows for a discrimination 
of irregular patterns vs. characters even though these irregu- 
lar patterns can consist of the same features (line strokes) as 
the characters themselves. Methods not representing char- 
acters explicitly (e.g., [ ]) are, therefore, not applicable or 
would, at the least, require additional mechanisms to iden- 
tify characters and to discriminate them against irregular 
patterns. 

By applying our approach we have shown in this study 
that even under difficult conditions a perfect reconstruction 
of a document is possible with solely the information on a 
single page. The result of the cleaning procedure depended 
on the factors like the severity of the corruption, the number 
of character instances per character type, and on the similar- 
ity between character patterns and corrupting patterns. Very 
simple characters like "I", "V" or "C" are, for instance, eas- 
ier to confuse with random line strokes than more complex 
characters. Furthermore, the more character types a docu- 



ment contains the more challenging the discrimination be- 
tween characters becomes, especially for strongly corrupted 
data. This is true for learning as well as for character iden- 
tification. Regarding required data, we usually observed 
good result in our experiments for more than 200 charac- 
ter instances per character type. Performance significantly 
decreased for less than 100 instances, primarily due to less 
appropriate learning of the character representations. The 
example of Fig. 3e contains about 250 instances per char- 
acter type (1251 characters in total). A page with text con- 
sisting of the full alphabet of letters, even if constrained to 
just lower or upper case, would therefore not provide suffi- 
ciently many examples for self-cleaning. A natural exten- 
sion of the addressed task for more character types would, 
therefore, require several pages. If we assume that about 
200 examples per character type are needed and if a page 
contains 1000 characters in total, we would require about 
6 pages to learn a full Latin alphabet of lower-case letters. 
For the general type-set of all letters and numbers (exclud- 
ing special characters), we would require about 13 pages. If 
we, furthermore, consider that, e.g., just 0.074% of all char- 
acters in the English language are of type 'z' [ ], then the 
number of required pages would increase to about 270. To 
execute the cleaning procedure described in this work, pro- 
cessing of 270 pages amounts to unreasonably long compu- 
tation times (even using parallel implementations). 

On the other hand, the cleaning performance can be fur- 
ther improved by exploiting further regularities of text doc- 
uments. The regular arrangement of characters along a line 
(compare [ ]) could be used to predict the positions of 
characters, and linguistic regularities (e.g., probabilistic lan- 
guage models) could be used to predict character types from 
context. Using probabilistic generative approaches, such 
prior knowledge can be integrated into the model by con- 
structing more sophisticated prior distribution p{c^x\ 6). 
Also on the algorithmic side improvements can certainly be 
made, e.g., by using a multiple-cause structure (e.g., [1 ]) 
to recognize multiple patterns in a patch simultaneously, or 
by using image features with scale invariance and contrast 
normalization (e.g., SIFT [20], HOG [21]). Different font 
sizes of characters can be handled by modeling them as dif- 
ferent patterns, adding scaling transformations to the model 
(dramatically increasing the computational complexity), or 
estimating font sizes with separate mechanisms. 

By applying the probabilistic approach described in this 
work, we have for the first time shown that it is in princi- 
ple possible to autonomously clean text documents which 
are heavily corrupted by irregular patterns. Future devel- 
opments can further improve the cleaning performance by 
exploiting regularities of words and sentences, or they can 
extend the application domain of the approach. 
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